Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: hostname port issue #4816

Merged
merged 4 commits into from
Feb 2, 2023
Merged

fix: hostname port issue #4816

merged 4 commits into from
Feb 2, 2023

Conversation

LanceEa
Copy link
Contributor

@LanceEa LanceEa commented Jan 26, 2023

Description

Note - this PR looks to address the broken behavior and not to address developer experience (DX) with the Host.hostname and ports. I think that will take more thought and potentially breaking changes so the goal of this PR is to restore existing behavior users had in v1.Y and to fix broken envoy configuration

When trying to expose a Host with a hostname that includes a non-standard port such as 'example.com:8500' over a TLS connection, a client would receive a 404 NR's due to Envoy not being able to find a valid route.

For example:

apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: basic-host
  namespace: default
spec:
  hostname: 'example.com:8500'
  tlsSecret:
    name: tls-cert
  tlsContext: 
    name: my-tls-context
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: quote-backend
  namespace: default
spec:
  hostname: "example.com:8500"
  prefix: /backend/
  service: quote

In v1.14, emissary would group everything under a single FilterChain with a wildcard virtual_host ("*") and then do :authority header matching using header matches on the Route. This allowed the above configuration to work but had the downside of causing large memory usage and slower route matching due to lumping all routes into a single virtual host.

In v2.Y/ v3.Y, this was addressed by creating separate Filter Chains for each Host. A non-tls Host would get a single FilterChain with multiple virtual_hosts per Host. A Host with TLS will produce a 1-1 FilterChain and virtual_host. This works fine in most cases when downstream clients are connecting over standard ports (80/443) because the dns portion of the :authority header matched the virtual_host.domain. But, when a client needs to connect on something like
the example above this would effectively generate a Filter Chain that could never match on an incoming request due to the overloaded nature of how we use Host.hostname.

Before Fix (psuedo envoy config):

FilterChains:
 - FilterChainMatching:
    server_names: ["example.com:8500"]
    transport_protocol: tls
   FilterChain:
    virtual_hosts:
      - domains: ["example.com:8500"]
        routes:
          - prefix: /backend/
            headers:
              - name: ":authority"
                exact_match: example.com:8500

After Fix

In the non-tls scenario there are no changes to what gets generated since we behave similar to v1.y in that we don't have sni to differentiate FilterChains so virtual_hosts per Host are generated along with route header matching.

In the TLS scenario we now parse the hostname to determine the sni. SNI doesn't not understand ports which is why the configuration above is effectively dead and no connections would ever be made to that Filter Chain.

So, example.com:8500 would become:

  • sni=example.com
  • hostname=example.com:8500

We now create a Filter Chain per sni value and add a virtual host for the full hostname which will now produce:

FilterChains:
 - FilterChainMatching:
    server_names: ["example.com"]
    transport_protocol: tls
   FilterChain:
    virtual_hosts:
      - domains: ["example.com:8500"]
        routes:
          - prefix: /backend/
            headers:
              - name: ":authority"
                exact_match: example.com:8500

If a second Host with a hostname=example.com is provided then we would have a conflict with the 1 FilterChain to 1 virtual Host design that is used in the TLS scenario. Therefore, we will attempt to merge these into a single Filter Chain with
multiple virtual_host. We can only do this merge when the same TLSContex is used because if we didn't then we wouldn't know which TLS settings because these are set at the FilterChain level and not the virtual host level. This also means that using Host.tls is not supported in this scenario because internally those are translated to implicit TLSContext and would have the same issue where two Host could be in conflict.

apiVersion: getambassador.io/v3alpha1
kind: TLSContext
metadata:
  name: my-tls-context
  namespace: default
spec:
  secret: tls-cert
  hosts: ["*.local", "*.local:8500"]
---
apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: basic-host-2
  namespace: default
spec:
  hostname: 'example.com'
  tlsSecret:
    name: tls-cert
  tlsContext: 
    name: my-tls-context
---
apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: basic-host
  namespace: default
spec:
  hostname: 'example.com:8500'
  tlsSecret:
    name: tls-cert
  tlsContext: 
    name: my-tls-context
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: quote-backend
  namespace: default
spec:
  hostname: "example.com:8500"
  prefix: /backend/
  service: quote
 ---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: quote-backend-2
  namespace: default
spec:
  hostname: "example.com"
  prefix: /backend2/
  service: quote

In that scenario we would produce the following:

FilterChains:
 - FilterChainMatching:
    server_names: ["example.com"]
    transport_protocol: tls
   FilterChain:
    virtual_hosts:
      - domains: ["example.com:8500"]
        routes:
          - prefix: /backend/
            headers:
              - name: ":authority"
                exact_match: example.com:8500
      - domains: ["example.com"]
        routes:
          - prefix: /backend2/
            headers:
              - name: ":authority"
                exact_match: example.com

Related Issues

I might be missing some but these capture it mostly.

Testing

CI is green here and pulled into EdgeStack as well as seen here: https://github.com/datawire/apro/pull/3228

I have added some unit tests that are similar to the existing golden test we had a while back. They allowed for a much faster dev-loop but they may suffer from the same challenges we had before where any touch to the default generated config will require them to be updated. I have left them in place for now but we may want to revisit it if it becomes a pain.

Checklist

  • Does my change need to be backported to a previous release?
  • I made sure to update CHANGELOG.md.
  • This is unlikely to impact how Ambassador performs at scale.
  • My change is adequately tested.
  • I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently.
  • The changes in this PR have been reviewed for security concerns and adherence to security best practices.

When unit testing diagd most of the tests us the `econf_compile` helper
which by default will assert that no aconf errors existed. This is good
but it requires puttng a debugger on it to grab the errors.

This prints out the object so that when the test is run outside of
an IDE with a debugger the errors can be seen. Altough its not neatly diffed/printed
due to being a nested structure. There is an issue to improve this but looks
to be still open: pytest-dev/pytest#1531.

Signed-off-by: Lance Austin <laustin@datawire.io>
@LanceEa LanceEa force-pushed the laustin/fix-hostname-port-issue branch 3 times, most recently from 79c3b19 to 692b558 Compare January 27, 2023 12:35
When trying to expose a Host with a hostname that includes a port such
as 'example.com:8500', causes 404 NR's due to the way envoy was
configured.

In the v1.14, we used to group everything under a wildcard virtual_host
domain "*" and did not include sni matching  on the Filter Chain. This
allowed that configuration to work but had the downside of causing
large memory usage and slower route matching due to lumping all
routes into a single virtual host.

If the v2.Y and v3.Y, series this was addressed by creating seperate
Filter Chains for each host. A non-tls Host would get a single Filter
Chain with multiple virtual_hosts per Host. A Host with TLS would
produce a 1-1 FilterChain and Virtualhost. This works fine in most
cases when downstream clients are connecting on standard ports
(80/443) but when a client needs to connect on something like
example.com:8500 this would effectively generate a Filter Chain
that could never match on an incoming request.

In the non-tls scenario there are no changes to what gets generated. In
the TLS scenario we now parse the hostname into two entites, sni and
virtual_host.domain. So, example.com:8500 would have an sni of
example.com and a virtual_host.domain of example.com:8500.
We create a Filter Chain for the sni value and add a virtual host for the
domain value. If a second Host with just `example.com` is provided as
well then we will attempt to merge these into a single Filter Chain with
multiple virtual_host. We can only do this when the same TLSContext
is used because we wouldn't know which attributes to take from which
host since all the transport_socket settings are at the Filter Chain level.

This restores existing behavior in a backwards compatiabile way and
doesn't try to solve the Developer Experience (DX) with the way the
Host is currently designed.

Signed-off-by: Lance Austin <laustin@datawire.io>
@LanceEa LanceEa force-pushed the laustin/fix-hostname-port-issue branch from 692b558 to 4847a15 Compare January 27, 2023 13:34
@LanceEa LanceEa marked this pull request as ready for review January 27, 2023 17:10
Alice-Lilith
Alice-Lilith previously approved these changes Jan 30, 2023
Copy link
Member

@Alice-Lilith Alice-Lilith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the changelog but otherwise no concerns

docs/releaseNotes.yml Show resolved Hide resolved
@@ -391,17 +394,17 @@ def save_context(self, ir: "IR", ctx_name: str, tls_ss: SavedSecret, tls_name: s
# TLS config is good, let's make sure the hosts line up too.
context_hosts = ctx.get("hosts")

# XXX WTF? self.name is not OK as a hostname! Leaving this for the moment, but it's
# almost certainly getting shredded before 2.0 GAs.
host_hosts = [self.name]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the ramifications of getting rid of name? Basically why can we do it now and it wasn't done before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed offline, I dug into this deeper and although we don't document this anywhere there is a potential that a user could ignore our docs and intpret the TLSContext.hosts as needing to match the name of a Host.name.

Although, it is unlikely and it is not our intended behavior,I will add it back since it is not crucial to the fix for this PR. I will capture it along with the other deprecated TLS features that I outlined and we can plan for its removal in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the overall issue on this topic:
#4826

I make a mention to this and that it is deprecated. I will have a fixup commit here shortly that does remove some dead code where we check self.hostname or self.name because we default self.hostname to "*" if user doesn't provide it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR I'm the one who wrote that comment. self.name is not guaranteed to be something that's actually sane for a DNS match, IIRC, so I'm in favor of it going away. However, it used to be important that host_hosts never be empty; might be worth a test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Newer folks, if you see a comment that starts with XXX, it's almost certainly something I wrote. 🙂)

other = next(iter(self.hosts.values()))
other_type = "TCPMapping" if isinstance(other, IRTCPMappingGroup) else "Host"
tcpmapping.post_error(
"TCPMapping {tcpmapping.name}: discarding because it conflicts with {other_type} {other.name}"
f"TCPMapping {tcpmapping.name}: discarding because it conflicts with {other_type} {other.name}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we're not going to allow a Host resource with hostname example.com:5000 and a TCPMapping with host example.com regardless of port specified on the TCPMapping?

What if you specified a different port on the TCPMapping? Theoretically you should be able to put them on different filter chains with different filter chain matching based on SNI & port.

Or this that case handled elsewhere since believe a new listener is created if they aren't any with the specified TCPMapping port

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we're not going to allow a Host resource with hostname example.com:5000 and a TCPMapping with host example.com regardless of port specified on the TCPMapping?

Correct, in v3listeners.finalize we process TCPMappings first and their filter chains get precedence on a Listener. Luke left a comment in that function that outlines it well, the reasoning for this has to do with providing sensible upgrade paths.

Since TCPMapping gets precedence in the scenario you outline they are both trying to generate a FilterChain with the same FilterChainMatch (TLS & SNI (example.com) which Envoy doesn't allow. Technically, you could differentiate these filter chains by adding alpn application_protocol: [h2,http/1.1] and Envoy would accept it but Envoy itself calls out the fact that ALPN is not reliable so I think this would cause more headaches then it benefits.

What if you specified a different port on the TCPMapping? Theoretically you should be able to put them on different filter chains with different filter chain matching based on SNI & port.

Or this that case handled elsewhere since believe a new listener is created if they aren't any with the specified TCPMapping port

Correct, that would work because the TCPMapping would attach to an existing Listener or implicitly create one behind the scenes for that address/port thus there would be no overlap.

Couple of things

  1. I think this function where you commented on, can be confusing because it relies on having knowledge of the TCPMappings taking precedence in V3Listeners.finalize and also the method chaining where the V3Listener.add_chain is shared by TCP/HTTP chains and then returns the V3Chain. So, mentally you have switch context pending which code path led you here.

  2. While I was looking into this I did find a bug when trying to apply a non-tls cleartext TCPMapping to a listener that serves HTTP. I have captured that here: tcpmappings: invalid envoy configuration generated when applying non-tls TCPMapping on listener with HTTP #4825

@@ -62,7 +64,8 @@ def __init__(self, config: "V3Config", context: Optional["IRTLSContext"]) -> Non

self.context = context
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is honestly just a nit but I keep getting confused what context stands for and have to constantly look up to see it refers to a TLSContext 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree...but its fairly pervasive and would be out of scope for this PR :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. @haq204, you're not wrong. 😐
b. A bunch of this is stuff I wrote before TLSContext existed as a named thing.
c. Sorry. 😇

self.add_chain("tcp", context, group_host).add_tcphost(irgroup)

sni = group_host.rsplit(":", 1)[0] if group_host else "*"
self.add_chain("tcp", context, group_host, group_host).add_tcphost(irgroup)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to pass in the sni variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually going to push a fixup commit for this. I had originally done this for symmetry between HTTP and TCP but the TCPMapping already has host and port. The port is required but the host is optional if the user wants to do SNI and match against TLSContext. Therefore, for this PR I think it is out-of-scope and we can leave it. But, maybe it would be good to provide a earlier TCPMapping validation on the host to ensure the user hasn't done it. That to me though is a new feature and could be done in a follow-up PR.

Signed-off-by: Lance Austin <laustin@datawire.io>
Signed-off-by: Lance Austin <laustin@datawire.io>
@LanceEa LanceEa merged commit f70b5ae into master Feb 2, 2023
@LanceEa LanceEa deleted the laustin/fix-hostname-port-issue branch February 2, 2023 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants